Sejong Korean Corpora in the Making

نویسندگان

  • Beom-mo Kang
  • Hunggyu Kim
چکیده

The 21st Century Sejong Project is a comprehensive project aiming to build various kinds of language resources including Korean corpora, comparable to BNC (Aston & Burnard, 1998), and Korean electronic dictionaries. The project was conceived of in 1997 and started in 1998 as a 10-year long-term project. By 2003, we completed 6 years of our work. The Sejong Corpora are a collection of raw corpora of modern Korean (written and spoken), North Korean, Korean used abroad, old Korean, and oral folklore literature. They also include parallel corpora consisting of Korean and other languages such as English and Japanese. Among these, a morph tagged corpus is a central part. In the process of compiling these corpora we followed suggestion from Text Encoding Initiative (TEI, Sperberg-McQueen & Burnard, 1994) to a certain degree. By 2003, we compiled a modern Korean raw corpus of 57 million words. We have additional 75 million words of already existing electronic texts which were processed and standardized in the first year of the Sejong project. These raw texts are mostly written Korean. We have relatively small amount, around 3 million, of spoken words. The morph tagged corpus is morphologically analyzed written Korean, around 10 million words by the end of 2003. The morph sense tagged corpus, which is the result of disambiguation of morphs, has 5.5 million words. From 2002 we started to build a treebank, i.e. syntactically analyzed Korean sentences on the basis of simple phrase structure grammar rules. Currently, we only have 0.15 million words being part of syntactic trees. Written corpora, i.e. a raw corpus of modern Korean, a morph tagged corpus, a morph sense tagged corpus, and a treebank, have been compiled at Center for Electronic Texts of Korea University. The following table is a summary.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A Tool for Korean Semantic Annotated Corpus Construction

Despite that the semantic annotated corpus is necessary in semantic role labeling, there is no semantic annotated corpus constructed for Korean. This paper establishes a tool for the construction of the Korean semantic annotated corpus including Korean Proposition Bank (PropBank). Sejong predicate case frame dictionary was used as one of the linguistic resources, and a Korean syntactic annotate...

متن کامل

The Xavier Module – Information Processing of Treebanks

This paper aims to introduce the Xavier module, a program package to process Treebanks (in particular, the Sejong Korean Treebank). In this paper, the procedure of implementing Xavier is discussed, and main usage of the program is also provided. Though this paper focuses on the Sejong Korean Treebank, Xavier is also applicable to other Treebanks, such as the Penn Treebanks, because it has been ...

متن کامل

Extraction of Tree Adjoining Grammars from a Treebank for Korean

We present the implementation of a system which extracts not only lexicalized grammars but also feature-based lexicalized grammars from Korean Sejong Treebank. We report on some practical experiments where we extract TAG grammars and tree schemata. Above all, full-scale syntactic tags and well-formed morphological analysis in Sejong Treebank allow us to extract syntactic features. In addition, ...

متن کامل

Extracting Syntactic Features from a Korean Treebank

In this paper, we present a system which can extract syntactic feature structures from a Korean Treebank (Sejong Treebank) to develop a Feature-based Lexicalized Tree Adjoining Grammars.

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2004